LongMemEval

Highlights

We introduce LongMemEval, a comprehensive, challenging, and scalable benchmark for testing the long-term memory of chat assistants.

Benchmark Construction

We meticulously create 500 questions of seven types (see examples above) to test five long-term memory abilities:

Information Extraction: Ability to recall specific information from extensive interactive histories, including the details mentioned by either the user or the assistant.
Multi-Session Reasoning: Ability to synthesize the information across multiple history sessions to answer complex questions that involve aggregation and comparison.
Knowledge Updates: Ability to recognize the changes in the user’s personal information and update the knowledge of the user dynamically over time.
Temporal Reasoning: Awareness of the temporal aspects of user information, including both explicit time mentions and timestamp metadata in the interactions.
Abstention: Ability to refrain from answering questions that involve unknown information, i.e., information not mentioned in the interaction history.

The following figure showcases the question distribution, the number of sessions required to find the answer, and the location of the evidence statements inside sessions.

Inspired by the "needle-in-a-haystack" test, we design an attribute-controlled pipeline to compile a coherent, extensible, and timestamped chat history for each question. Two standard test sets are created:

LongMemEval_S: each question's chat history has roughly 115k tokens (30-40 sessions)
LongMemEval_M: each question's chat history has roughly 500 sessions (~1.5M tokens)

LongMemEval is Challenging

Surprisingly, we find long-context LLMs show a 30%∼60% performance drop on LongMemEval_S, and manual evaluations reveal that state-of-the-art commercial systems (such as GPT-4o) only achieve 30%∼70% accuracy in a setting much simpler than LongMemEval_S. Even the most capable long-context LLMs currently would require an effective memory mechanism to manage an ever-growing interaction history.

A Unified View of Memory Systems

Finally, we formulate a three-stage long-term memory model for chat assistants. Despite its simplicity, this model provides a unified view of existing long-term memory assistant works and enables us to investigate four crucial control points for each stage’s design.

[Finding 1] Instead of sessions, round is the best granularity for storing and utilizing the interactive history. While further compression into individual user facts harms overall performance due to information loss, it improves the multi-session reasoning performance.

[Finding 2] While using a flat index with the memory values themselves as the keys is a strong baseline, expanding the keys with extracted user facts greatly facilitates both memory recall (4% higher recall@k) and downstream question answering (5% higher accuracy).

[Finding 3] Simplistic memory designs perform poorly on temporal reasoning questions. We propose a simple time-aware indexing and query expansion strategy to narrow down the search range, which improves the memory recall for the temporal reasoning by 7%∼11%.

[Finding 4] Even with perfect memory recall, accurately reading the retrieved items is still non- trivial. Applying Chain-of-Note and structured JSON prompt format improves the reading accuracy by as much as 10 absolute points across three LLMs.

BibTeX

@misc{wu2024longmemeval,
  title={LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory}, 
  author={Di Wu and Hongwei Wang and Wenhao Yu and Yuwei Zhang and Kai-Wei Chang and Dong Yu},
  year={2024},
  eprint={2410.10813},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2410.10813}, 
}